What's New in Aethos 2.0
Summary of new features in Aethos 2.0
- What is Aethos?
- Problems with Aethos 1.0
- What's new in Aethos 2.0
- Examples
- Modelling
- Model Analysis
- Feedback
In late 2019 I released Aethos 1.0, the first iteration of a package to automate common data science techniques. Since then I’ve received great feedback on how to improve Aethos which I’ll introduce here! It will be a lot of code examples to show the power and versatility of the package.
You can view the previous posts about Aethos on my blog!
For those new to Aethos, Aethos is a Python library of automated data science techniques and use cases from missing value imputation, NLP pre-processing, feature engineering, data visualization to modelling, model analysis and model deployment.
To see the full capabilities and the rest of the techniques and models you can run, checkout the project page on Github!
Alot of the problems with the first version of Aethos were related to the usability of the package and its API. The major problems were:
-
Slow import times due to the number of files and coupled packages.
-
Having 2 objects for end to end analysis - Data for transformations and Model for modelling
-
Model object had every model and was not specific to Supervised or Unsupervised problems.
-
Unintuitive API calls for adding new columns to the underlying DataFrames
-
Reporting feature was, well, garbage and becoming redundant with external tools like converting notebooks to pdfs.
-
API had limited use cases. You couldn't just analyze your data, or just analyze a model you trained without Aethos.
-
Aethos and Pandas were not interchangeable and did not work together when transforming data.
Aethos 2.0 looks to address the intuitiveness and usability of the package to make it easier to use and understand. It also addresses the ability to work with Pandas Dataframes side by side with Aethos.
-
Reduced import time of the package by simplifying and decoupling of the Aethos modules.
-
Only 1 object to analyze, visualize, transform, model and analyze results.
-
Can now specify the type of problem - Classification, Regression or Unsupervised and only see the models specific to those problems.
-
Removed the complexity of adding data to the underlying dataframes through Aethos objects. You can access the underlying dataframes with the
x_trainandx_testproperties. -
Removed reporting feature.
-
Introduced new objects to support new cases:
-
Analysis: To analyze, visualize and run statistical analysis (t-test, anova, etc.) on your data.
-
Classification: To analyze, visualize, run statistical analysis, transform and impute your data to run classification models.
-
Regression: To analyze, visualize, run statistical analysis, transform and impute your data to run regression models.
-
Unsupervised: To analyze, visualize, run statistical analysis, transform and impute your data to run unsupervised models.
-
ClassificationModelAnalysis: Interpret, analyze and visualize classification model results.
-
RegressionModelAnalysis: Interpret, analyze and visualize regression model results.
-
UnsupervisedModelAnalysis: Interpret, analyze and visualize unsupervised model results.
-
TextModelAnalysis: Interpret, analyze and visualize text model results.
-
-
Removed dot notation when accessing DataFrame columns.
-
Can now chain methods together.
!pip install aethos
import pandas as pd
import aethos as at
at.options.track_experiments = True # Enable experiment tracking with MLFlow
To showcase each of the objects let's load in the titanic dataset.
orig_data = pd.read_csv('https://raw.githubusercontent.com/Ashton-Sidhu/aethos/develop/examples/data/train.csv')
orig_data.describe()
The analysis objects is mainly for quick, easy analysis and visualization of data. It doesn't have the ability to run automated cleaning and transformation techniques of Aethos, just visualizations and statistical tests. It also does not split your data, but you do have the option to provide a test set.
df = at.Analysis(orig_data, target='Survived')
df.describe()
df.missing_values
df.column_info()
df.standardize_column_names()
df.describe_column('fare')
df.data_report()
Easily view the histogram of multiple features.
df.histogram('age', 'fare', hue='survived')
Create a configurable correlation matrix.
df.correlation_matrix(data_labels=True, hide_mirror=True)
We can easily plot the average price each age paid for a ticket.
df.barplot(x='age', y='fare', method='mean', labels={'age': 'Age', 'fare': 'Fare'}, asc=False)
We can also easily view the relationship between age and fair and see the difference between those who survived and who didn't.
df.scatterplot(x='age', y='fare', color='survived', labels={'age': 'Age', 'fare': 'Fare'}, marginal_x='histogram', marginal_y='histogram')
You can visualize other plots like raincloud, violin, box, pairwise, etc. I recommend checking out the examples for more!
One of the big changes is that ability to work with pandas side by side. If you want to transform and work with data solely with Pandas, the Analysis object will reflect those changes. This allows you to use Aethos solely for automated analysis and Pandas for transformations.
To demonstrate this we will make a new boolean feature to see if a passenger was a child using the original pandas dataframe we created
orig_data['is_child'] = (orig_data['age'] < 18).astype(int)
orig_data.head()
Now let's see it in our Analysis object.
df.head()
df.boxplot(x='is_child', y='fare', color='survived')
You can still run pandas functions on Aethos objects.
df.nunique()
df['age'].nunique()
Introduced in Aethos 2.0 are some new analytic techniques.
The predictive power score is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix). Credits go to 8080Labs for creating this library and you can get more info here
df.predictive_power(data_labels=True)
AutoViz auto visualizes your data and displays key plots based off the characteristics of your data. Credits go to AutoViML for creating this library and you can get more info here.
df.autoviz()
Aethos 2.0 introduces 3 new model objects: Classification, Regression and Unsupervised. These objects have the same capabilities of the Analysis object, but also can transform your data the same way it did in Aethos 1.0. For those new to Aethos, whenever you use Aethos to apply a transformation, it fits it to the training data and applies it to both the training and test data (in the case of Classification and Regression) to avoid data leakage.
In this post we'll cover the Classification object but the process is the exact same if you were working with a Regression or Unsupervised problem.
df = at.Classification(orig_data, target='Survived', test_split_percentage=.25)
As with Aethos 1.0 if no test data is provided, it is split upon initialization. In Aethos 2.0 it uses stratification for classification problems to split the data to ensure some resemblance of class balance.
df.describe()
df.x_train.head()
df.x_test.head()
df.missing_values
df.checklist()
df.standardize_column_names()
Since this is an overview, let's select the columns were going to work with and drop the ones we're not going to use.
df.drop(keep=['survived', 'pclass', 'sex', 'age', 'fare', 'embarked'])
Let's chain our transformations together. Remember our transformations will be fit to the training data and automatically transform our test data!
is_child = lambda df: 1 if df['age'] < 18 else 0
df.replace_missing_median('age') \
.replace_missing_mostcommon('embarked') \
.onehot_encode('sex', 'pclass', 'embarked', keep_col=False) \
.apply(is_child, 'is_child') \
.normalize_numeric('fare', 'age')
df.x_train.head()
df.x_test.head()
Now let's train a Logistic Regression model.
We'll use gridsearch and it will automatically return the best model. We'll use Stratified K-fold for the Cross Validation technique during grid search.
gs_params = {
"C": [0.1, 0.5, 1],
"max_iter": [100, 1000]
}
lr = df.LogisticRegression(
cv_type='strat-kfold',
gridsearch=gs_params,
random_state=42
)
Once a model is trained a ModelAnalysis object is returned which allows us to analyze, interpret and visualize our model results. Included is a list to help you debug your model if it’s overfit or underfit!
df.help_debug()
You can quickly cross validate any model by calling cross_validate on the resulting ModelAnalysis object. It will display the mean score across all folds and a learning curve.
For classification problems the default cross validation method is Stratified K-Fold. This allows to maintain some form of class balance, while for regression, the default is K-Fold.
lr.cross_validate()
lr.metrics() # Note this displays the results on the test data.
Lets's manually train a Logistic Regression and view and verify the results.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score
X_train = df.x_train.drop("survived", axis=1)
X_test = df.x_test.drop("survived", axis=1)
y_train = df.x_train["survived"]
y_test = df.x_test["survived"]
clf = LogisticRegression(C=1, max_iter=100, random_state=42).fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred).round(3)}")
print(f"AUC: {roc_auc_score(y_test, clf.decision_function(X_test)).round(3)}")
print(f"Precision: {precision_score(y_test, y_pred).round(3)}")
Results are the same!
Similar to Modelling, Aethos 2.0 introduces 4 model analysis objects: ClassificationModelAnalysis, RegressionModelAnalysis, UnsupervisedModelAnalysis and TextModelAnalysis. In Aethos 2.0 they can be initialized in 2 ways:
-
Result of training a model using Aethos
-
Initializing it on your own by providing a Model object, the train data used by the model and the test data to evaluate model performance (for Regression and Classification).
Similar to the Model objects we're going to explore the ClassificationModelAnalysis object but the process would be the same for regression, unsupervised and text model analysis.
To start, we'll pick off from where we left off with modelling and view the metrics for our Logistic Regression model.
type(lr)
lr.metrics()
You can also set project metrics based off your business requirements.
at.options.project_metrics = ["Accuracy", "ROC AUC", "Precision"]
lr.metrics()
If you want to just view individual metrics, there are functions for those to!
lr.fbeta(beta=0.4999)
You can analyze any models results with just one line of code:
- Metrics
- Classification Report
- Confusion Matrix
- Decision Boundaries
- Decision Plots
- Dependence Plots
- Force Plots
- LIME Plots
- Morris Sensitivity
- Model Weights
- Summary Plot
- RoC Curve
- Individual metrics
And this is only for Classification Models, each type of problem has their own set of ModelAnalysis functions!
lr.classification_report()
lr.confusion_matrix()
You can supply features from your train set to the dependency plot otherwise it will just use the first 2 features in your model. Under the hood it uses YellowBricks Decision Boundary visualizer to create the visualizations.
lr.decision_boundary('age', 'fare')
lr.decision_boundary()
Included are also automated SHAP use cases to interpret your model!
lr.decision_plot()
lr.dependence_plot('age')
lr.force_plot()
lr.interpret_model()
View the highest weighted features in your model.
lr.model_weights()
Easily plot an RoC curve.
lr.roc_curve()
lr.summary_plot()
Finally we can generate the files to deploy our model through a RESTful API using FastAPI, Gunicorn and Docker!
lr.to_service('aethos2')
If we manually trained a model like we did earlier in the notebook and wanted to use Aethos's model analysis capabilties we can!
lr = at.ClassificationModelAnalysis(
clf,
df.x_train,
df.x_test,
target='survived',
model_name='log_reg'
)
You will receive the same results as above, thus giving you the ability to manually transform your data, train your model and use Aethos to interpret the results. I've included them below for verification.
lr.metrics()
lr.decision_boundary('age', 'fare')
lr.decision_boundary()
lr.decision_plot()
lr.dependence_plot('age')
lr.force_plot()
lr.interpret_model()
lr.model_weights()
lr.roc_curve()
lr.summary_plot()b
lr.to_service('aethos2')
I encourage all feedback about this post or Aethos. You can message me on twitter or e-mail me at sidhuashton@gmail.com.
Any bug or feature requests, please create an issue on the Github repo. I welcome all feature requests and any contributions. This project is a great starter if you’re looking to contribute to an open source project — you can always message me if you need assistance getting started.